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Abstract 

Background: Inferring gene orders of ancestral genomes has the potential to provide detailed information about 
the recent evolution of species descended from them. Current popular tools to infer ancestral genome data (such 
as GRAPPA and MGR) are all parsimony-based direct optimization methods with the aim to minimize the number 
of evolutionary events. Recently a new method based on the approach of maximum likelihood is proposed. The 
current implementation of these direct optimization methods are all based on solving the median problems and 
achieve more accurate results than the maximum likelihood method. However, both GRAPPA and MGR are 
extremely time consuming under high rearrangement rates. The maximum likelihood method, on the contrary, 
runs much faster with less accurate results. 

Results: We propose a mixture method to optimize the inference of ancestral gene orders. This method first uses 
the maximum likelihood approach to identify gene adjacencies that are likely to be present in the ancestral 
genomes, which are then fixed in the branch-and-bound search of median calculations. This hybrid approach not 
only greatly speeds up the direct optimization methods, but also retains high accuracy even when the genomes 
are evolutionary very distant. 

Conclusions: Our mixture method produces more accurate ancestral genomes compared with the maximum 
likelihood method while the computation time is far less than that of the parsimony-based direct optimization 
methods. It can effectively deal with genome data of relatively high rearrangement rates which is hard for the 
direct optimization methods to solve in a reasonable amount of time, thus extends the range of data that can be 
analyzed by the existing methods. 



Background 

Inferring gene orders and gene content of ancestral gen- 
omes has a wide range of applications. High-level rear- 
rangement events such as inversions, transpositions that 
change gene order are important because they are "rare 
genomic events" [1] and can be used to estimate ances- 
tral genomes and infer the number of steps that evolve 
one genome into another. 

The fundamental question in building phylogenies is 
how far apart two species are from each other. Hannen- 
halli and Penvzner [2] provided the first polynomial 
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algorithm for computing inversion distance, which can 
be used as a representation of the evolution distance 
between species. Then by using a direct optimization 
approach based on median calculation, which is to opti- 
mize each ancestral node in terms of its three or more 
immediate neighbors, the phylogeny and ancestral gen- 
omes could be reconstructed. Current popular methods 
(such as GRAPPA [3], MGR [4] and their new improved 
versions) all use this approach and can infer ancestral 
gene orders with high accuracy. However these methods 
are extremely slow especially when the rearrangement 
rate is high. A new method based on the maximum like- 
lihood approach is recently proposed [5]. In this 
method, the probabilities of all possible ancestral gene 
order are calculated based on the present species' gene 
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order data and the ancestral gene orders are recon- 
structed to maximize the overall probability. This 
method is much faster than the direct optimization 
methods under high evolution rearrangement event rate, 
but the accuracy of the reconstructed gene order is 
lower than that of the direct optimization methods. 

In this paper, we propose a mixture method to 
enhance the inference of ancestral gene orders. We first 
use the maximum likelihood method to reconstruct an 
initial ancestral genome. Then we randomly select a 
number of gene adjacencies from the ancestral genome 
and fix them to run the median calculation. By analyz- 
ing the results of each median calculation, we try to 
infer the correct gene adjacencies generated by the max- 
imum likelihood method. Finally we fix these adjacen- 
cies and perform median calculation to get the ancestral 
genome. We have conducted extensive experiments on 
simulated datasets and the results show that this mix- 
ture approach is more accurate than the maximum like- 
lihood method. Also our method is much faster than 
solely using the median calculation when the rearrange- 
ment rate is high. Using this hybrid approach, for those 
datasets that are previously too difficult for existing 
methods, we will be able to analyze them within a rea- 
sonable time frame with very high accuracy. 

Genome rearrangements 

Given a set S of n genes {1, 2, fz}, a genome can be 
represented by an ordering of these genes. A gene with 
a positive orientation is written as otherwise it is writ- 
ten as a A genome can be linear or circular, A linear 
genome is a permutation on the gene set, while a circu- 
lar genome can be represented in the same way under 
the implicit assumption that the permutation closes 
back on itself. Let G be the genome with signed order- 
ing of 1, 2, n. An inversion (also called reversal) 
between indices / and ; (/ < /), transforms G to a new 
genome with ordering 

1, 2, • • • , I - 1, -j, -Q - 1), • • • , + 1, • • • , n 

There are some additional events for multiple-chro- 
mosome genomes, such as translocation (when the end 
of one chromosome switches with the end of another 
chromosome), 7^55/(9^ (when one chromosome splits to 
form two) ^.nd fusion (when two chromosomes combine 
to become one). 

We use a breakpoint graph [6] to represent the per- 
mutation with respect to the identity permutation. 
Given a genome with permutation tt, let the breakpoint 
graph that corresponds to it be M (tt). The vertex set V 
of M (n) is the collection of {2i - 1, 2i}, and / is any dis- 
tinct gene of permutation n. Two genes / and ; are said 
to be adjacent in genome G if / is immediately followed 



by ; and can be represented by an edge (2/, 2;-l). If / 
and are adjacent, then it can be represented by an 
edge (2/, 2/). The edge set E oi M (n) consists of all the 
adjacencies in tt. For example, for a circular unichromo- 
somal genome G2 = 1,4, 2, -3, 5, the vertex set is V = 
{1, 2, 3, 4, 5, 6, 7, 8, 9,10} and the edge set is E = 
{(10,1), (2, 7), (3, 8), (4, 6), (5, 9)}, as shown in Figure 1. 
The breakpoint graph extends naturally to a multiple 
breakpoint graph (MBG), representing a set of three or 
more genomes. 

Inferring ancestral gene order by median calculation 

One popular approach to reconstruct ancestral genome 
is through direct optimization and the task is to seek 
the tree with the minimum number of events, which is 
in spirit similar to the maximum parsimony approach 
used in sequence phylogeny. The core of these methods 
is to solve the median problem defined as follows: given 
a set of N genomes with permutations {TTi}i<i < ^ and a 
distance measure <i, find a permutation ttm that mini- 
mizes median score defined as Yl^i d(7ii,7iM)' 

Widely used methods lil<e GRAPPA [3] and MGR [7] (as 
well as their new improved versions) both use median cal- 
culation to infer phylogenies and ancestral gene orders. 
For example, GRAPPA examines every possible tree topol- 
ogy and reports the one with the minimum events. For a 
given tree, it iteratively updates the gene order on each 
ancestral node in terms of its three neighbors through 
median calculation until the whole score of the tree 
reaches minimum. At this point the gene order obtained 
on each internal node corresponds to the inferred ances- 
tral genome data and the total number of events is simply 
the summation of pairwise distances along all edges. 

The median problem is known to be NP-hard [6] under 
most rearrangement distances proposed so far. Recently 
Xu proposed the concept of Adequate Subgraph [8] and 
applied this theory to create the ASMedian solver, which 
is the fastest median solver to date. The Multiple Break- 
point Graph (MBG) is used to model the median pro- 
blem and an Adequate Subgraph is defined as a subgraph 
that has number of cycles larger or equal to 3m/2 (m is 
half the number of the vertices of the subgraph). Ade- 
quate subgraphs allow a decomposition of an MBG into 
smaller, more easily solved graphs. The median solution 
is the combination of the median solutions for the ade- 
quate subgraphs and the remaining MBG, so we can 
shrink the known median edges of the adequate subgraph 
and reduces the original MBG size. 

Figure 2 shows an example of adequate subgraph 
decomposition. Given three circular genomes {1, 2, 3}, 
{1, 2, -3} and {1, 3, 2}, the MBG of these three genomes 
are shown on the left. Vertices (2, 3) composes an ade- 
quate subgraph according to the definition of adequate 
subgraph, so we can shrink (2, 3) from the MBG and 
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get a reduced MBG as shown on the right. The final 
median solution is also shown in the figure and we can 
see that the edges of the median solution are within the 
reduced MBG. 

By searching the existing adequate subgraphs up to a 
certain size and incorporate them into an exhaustive algo- 
rithm for the median problem, the ASMedian solver could 
significantly reduce the computation time of median cal- 
culation, though it still runs very slowly with high gene 
rearrangement rate. Given n the number of genes in the 
dataset and r the expected number of events per edge, 
when the ratio of r/n is larger than 25%, all direct optimi- 
zation methods have great difficulty in finishing the analy- 
sis even after months of computation. As ASMedian (as 
well as other median algorithms) use a branch-and-bound 
approach, its performance relies on how to quickly prune 
branches. If we can fix some adjacencies before perform- 
ing median calculation, the search space will be further 
decreased and the problem would be solved much faster. 

Reconstruct the ancestral genomes based on maximum 
likelihood 

Another new approach to reconstruct the ancestral gen- 
ome is based on maximum likelihood. Ma proposed a 



probabilistic framework for inferring ancestral genomic 
orders with rearrangements including inversion, translo- 
cation and fusion [5]. When the phylogeny tree is given, 
the tree is re-rooted so that the ancestral node to be 
inferred becomes the root of the new tree. In this way, 
all the leave nodes can be taken into account for recon- 
structing the target ancestral node. 

Because the gene order of any ancestral node is 
unknown, the probabilities of all possible gene adjacen- 
cies within the genome need to be calculated. Since 
each internal node of the phylogeny tree has two chil- 
dren and we assume that both children evolved from 
the parent node, the posterior probability of any gene 
adjacency within the parent node can be computed 
when treating the gene orders of the children as 
observed data. The calculation can be performed 
recursively from the bottom leaves level to the top 
level until the root is reached. At this point the prob- 
abilities of all the possible gene adjacencies within the 
target ancestral genome are computed. Finally an 
approximate algorithm is used to pick up and connect 
the adjacencies together to maximize the overall prob- 
ability, and these selected adjacencies form the final 
inferred ancestral genome. 




(a) 




(b) 



Figure 2 An example of adequate subgraph, (a) An MBG based on three circular genomes {1, 2, 3} (thick lines), {1, 2, -3} (double lines), {1, 3, 
2} (thin lines), (b) The reduced MBG with (2 3) shrunk because (2, 3) is an adequate sub-graph. The dash line indicates the median solution 1, 2, 
3, which is the combination of the median solution of the two sub-graphs. 
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Methods 

We propose a framework that combines both maximum 
UkeUhood and direct optimization approaches to infer 
ancestral gene orders. The maximum Ukelihood method 
runs much faster with high genome rearrangement rate 
compared with the direct optimization methods, but the 
latter can infer more accurate gene orders. To speed up 
the latter, we need to reduce the computation time for 
the median calculation as it is clearly the bottleneck of 
the computation. Our method tries to pick up the cor- 
rectly inferred gene order from the maximum likelihood 
method and fix these adjacencies in the median calcula- 
tion. As a result, the median search space can be signifi- 
cantly reduced and the median calculation is 
accomplished much faster while the accuracy of the 
inferred gene order is also improved. 

Reduce median search space by fixing adjacencies 

We use Xu's ASMedian solver which is based on the 
Adequate Subgraph theory [8]. In order to speed up the 
median calculation, we have to further reduce the med- 
ian search space. Given permutation {tti}i<i<^, let the 
MBG that corresponds to it be M (tt). If we assume an 
adjacency (/, ;) should be present in the final median 
solution, we could force {/, to become an adequate 
subgraph of M (tt). The procedure is as follows: For 
each permutation tt, if edge (/, is already present in 
the breakpoint graph, we do nothing. Otherwise let (/, 
a), (/, b) be the edges corresponding to vertices / and y. 
We then remove both edges a), (/, b) and create two 
new edges {a, b), (/, j). Figure 3 shows the procedure 
stated above. Three permutations are represented by 
thin lines, double lines and thick lines respectively. For 
the permutation of thick lines, edge (/, ;) already exists, 
so nothing needs to be done. For the other two permu- 
tations, edge j) is not present. For the permutation 
represented by double line, we remove the original 
edges (c, i) and (<i, ;) which are incident to vertices /, ; 
and add an edge (/, ;) with double line. For the 



permutation represented by thin line, we remove the 
original edges {a^ i) and (Z?, ;) which are incident to ver- 
tices ; and add an edge (/, ;) with thin line. According 
to the definition of adequate subgraph, vertices {/, 
become an adequate subgraph because three edges are 
incident to these two vertices [8]. 

If we can fix a number of adjacencies, an equal num- 
ber of adequate subgraphs can be created. Thus we can 
shrink these subgraphs from the original multiple break 
point graph, and the original median problem will be 
greatly reduced. If the adjacencies that we try to fix are 
the actual adjacencies present in the final solution, this 
fixing procedure will not affect the accuracy of the med- 
ian solution. 

Infer the correct adjacencies 

Based on our observation and simulation, only a part of 
the adjacencies inferred by the maximum likelihood 
method are the correct adjacencies that are present in 
the ancestral genome. It will be desired if we can opti- 
mally pick out all the correct adjacencies and fix them 
for the median computation. However, it is very difficult 
to find out all the correct adjacencies inferred by the 
current maximum likelihood method as only a small 
percentage of adjacencies are reported with high prob- 
ability while others usually have similar small probability 
which cannot be used to judge whether an adjacency is 
correct or not. Our method to identify the correct adja- 
cencies is a randomized approach. First, we will keep 
adjacencies that have higher probabilities as "correct". 
Then we will randomly select a number of adjacencies 
from those with lower probability and fix them in the 
median computation using ASMedian. These randomly 
selected adjacencies may contain both correct and incor- 
rect adjacencies, but the correct adjacencies are more 
likely to lead the inference of other correct adjacencies 
within the ancestral genome after the median computa- 
tion. Thus after the median is found, we examine the 
non-fixed adjacencies in the "ancestral genome" 




o- 






Figure 3 Fix an adjacency in the MBG. On the left is the MBG with three permutations represented by thin lines, double lines and thick 
lines respectively. On the right is the MBG with Oy) fixed for all the permutations. 
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generated by the median solver and record the ones that 
also appear in the "ancestral genome" generated by the 
maximum likelihood method. We repeat the procedure 
many times. Finally for each adjacency generated by the 
maximum likelihood method, we get the number of 
times it appears in the resulted median genomes. The 
more it appears, the more likely it is a correct adjacency. 

The number of randomly selected adjacencies should 
not be too large. Too many fixed adjacencies may con- 
tain too many incorrect adjacencies to pollute the med- 
ian calculation and leads to bad results. On the other 
hand, the number of fixed adjacencies should not be too 
small, otherwise the time consumption of median calcu- 
lation is unacceptable since the search space is still too 
large. In our practice, we randomly choose 70-75% of 
the adjacencies and fix them in the median computa- 
tions. We also conducted extensive testing to verify that 
70-75% is indeed a good choice and the results are 
shown later. 

When more than three genomes are involved and a 
phylogeny is given, we will first infer part of the correct 
adjacencies for each ancestral genome that corresponds 
to an internal node. We will then use our modified 
median solver to iteratively update the gene order of 
these internal nodes with those fixed adjacencies. The 
tree edge length is the genomic distance between the 
nodes and the score of the tree is defined as the sum of 
all the tree edge. The iteration stops when the score of 
the tree dose not change, suggesting that convergence is 
reached. At this time, gene orders contained in the 
internal nodes are the final ancestral genomes to be 
reported. 

Results and discussion 

Experimental methods 

We use simulated data to assess the quality of our 
method as we know the "true" ancestors in simulations. 
In our experiments, we use two models to generate the 
data. First we generate model tree topologies from the 
uniformly distributed binary trees, each with 10 leaves 
indicating 10 genomes. We use different evolutionary 
rates, of 20%, 24% ... 36% expected rearrangement rates 
per tree edge, with 50% relative flucturation. We also 
use the model proposed by Lin and Moret [9] to gener- 
ate another set of random trees with 10 leaves. The tree 
diameters range from 2.0 to 4.0. The Lin and Moret's 
tree model has more unbalanced edge lengths and is 
suggested as more biologically realistic. 

Once the simulated genomes are generated, we use 
our mixture method to infer the ancestral gene order 
data. Because we are handling genomes with equal gene 
contents, the accuracy can be measured with the num- 
ber of correctly inferred adjacencies (the inferred adja- 
cencies that also appear in the original ancestral 



genome) divided by the total number of adjacencies for 
each genome. 

Number of replicates required 

As discussed before, we use maximum likelihood method 
to reconstruct the initial ancestral genome and then ran- 
domly select a number of gene adjacencies from the 
initial ancestral gene order data and fix them for median 
computation. This procedure is repeated until each adja- 
cency has been selected for enough times. Finally we 
record the number of occurrences of each gene adjacency 
that are present in the median results (excluding the case 
when it is fixed in median computation) and sort these 
adjacencies according to the number of occurrences. If 
the number of repeats is large enough, the sorting of the 
gene adjacencies will remain almost unchanged. 

According to our test, when the number of replicates 
is over 100, the sorting of the gene adjacencies comes to 
a steady state and there is little change with more 
repeated tests are conducted. Since we will pick a cer- 
tain percentage of adjacencies with higher numbers of 
occurrences, the very small change in adjacency sorting 
will hardly affect the result. 

Adjacency selection percentage 

When we sort the gene adjacencies by the number of 
their occurrences in the median calculation result, it is 
clear that the more an adjacency appears, the more it is 
likely to be a correct adjacency. Thus we can just pick 
the adjacencies with large numbers of occurrences and 
fix them in the median calculation. However, we need 
to determine the threshold of how many adjacencies 
should be fixed. We define the selection percentage as 
the number of selected adjacencies divided by the total 
number of adjacencies. From our experience, we found 
that the percentage cannot be lower than 65%, otherwise 
the median calculation are still very slow as the search 
space is not small enough. 

To determine the selection percentage, we use various 
percentages, from 65% to 85%, to select the gene adjacen- 
cies and fix them to reconstruct ancestral genomes 
through median computation. We compare the average 
accuracy of the reconstructed gene order under different 
selection percentages. Table 1 shows the average accu- 
racy of reconstructed gene order under different selection 
percentages, using uniform trees with various evolution- 
ary rates per edge, while Table 2 shows the results of Lin 
and Moret's tree model with various tree diameters. 

From these tables, we can see that when the rearran- 
gement rate or the tree diameter is relatively small 
(meaning genomes are closely related), the gene accu- 
racy reaches maximum values when the selection per- 
centage is 70%. When the rearrangment rate or the tree 
diameter grows larger, it would be better to use 75% as 
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Table 1 Accuracy of reconstructed gene order (%) under various selection percentages and rearrangement rates with 
uniformly distributed tree model 

65% selected 70% selected 75% selected 80% selected 85% selected 

20% events per edge 89.73 91.93 90.47 89.53 89.24 

24% events per edge 80.28 81.92 81.07 80.56 79.81 

28% events per edge 71 .59 73.1 2 73.28 71.10 71 .63 

32% events per edge 63.33 64.52 65.72 65.25 64.81 

36% events per edge 59.45 60.87 62.43 60.27 60.15 



Table 2 Accuracy of reconstructed gene order (%) under various selection percentages and diameters with Lin's 
random tree model 
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80.96 


79.37 
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the selection percentage. This is because with larger 
rearrangement rates, some correct adjacencies inferred 
by the maximum likelihood method are hard to be 
inferred by the median calculation, so we should use a 
larger selection percentage to include more adjacencies 
inferred by the maximum likelihood method. 

Accuracy on closely related genomes 

First we compare the accuracy of the ancestral genome 
reconstructed by our method with the result from both 
Ma's maximum likelihood approach and the pure ASMe- 
dian solver. Because ASMedian solver is very time con- 
suming, we only perform tests under lower rearrangement 
rates and diameters. We perform simulation tests using 
the two types of tree models as stated above with various 
rearrangement rates and tree diameters. For each config- 
uration we run 50 data sets and the results are averaged. 
Table 3 shows the comparison under uniform distribution 
binary trees with average rearrangement rates ranging 
form 12% to 20% and Table 4 shows the comparison 
under Lin's random trees with diameters ranging form 0.6 
to 1.2. As mentioned before, we can see that the median 
method produces more accurate results than those of the 
maximum likelihood method under all situations. Also we 
notice that our mixture method could produce a similar 
or even slightly better result than solely using ASMedian. 



These tables also show that the randomized approach 
could effectively infer correct adjacencies from the maxi- 
mum likelihood method and leads to equivalent accuracy 
of the best median solver. 

Accuracy on distant genomes 

ASMedian solver cannot finish datasets with higher 
rates or diameters, thus for distant genomes, we only 
compare the accuracy of the ancestral genome recon- 
structed by our method with the result of Ma's maxi- 
mum likelihood approach. To see the effectiveness of 
our new approach, we also compare the optimal solu- 
tion when all correct adjacencies in the maximum likeli- 
hood results can be told, which is impossible in real 
data analysis but is trivial in simulation as the truth is 
known. This optimal solution is also the upper bound 
that our mixture framework can achieve. 

Figure 4 shows the comparison under uniform distri- 
bution binary trees with different rearrangement rates 
per edge. We can see that our method can infer ances- 
tral gene order more accurately than Ma's method. At 
the rearrangement rate of 20% we get an average 
improvement of nearly 12%. With the increase of rear- 
rangment rates, the improvement decreases, but we still 
have about 4% in adjacency accuracy comparison when 
the rearrangement rate reaches 36%. Compared with the 



Table 3 Reconstructed gene order accuracy comparison with low rearrangement rates under uniformly distributed 
tree model 
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Table 4 Reconstructed gene order accuracy comparison 
with small diameters under Lin's random tree model 





Ma's 
method 


Pure median 
method 


Our mixture 
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optimal median result, the accuracy of our method are 
about 10% less at the rearrangement rate of 20%. The 
difference decreases when the arrangement increases, 
and we can see that the result of our method is still 
close to the optimal result. 

Figure 5 shows the comparison under random trees with 
different diameters. Because of more unbalanced edge 
lengths which decrease the accuracy of the gene order 
inferred by the median calculation, the improvement of 
the adjacency accuracy is not as much as the uniform tree 
model. But still we have an average improvement of about 
7% with diameter equals 2.0. With the increase of dia- 
meter, the improvement decreases and finally we still have 
more than 3% when diameter equals 4.0. Compared with 
the optimal result, the accuracy of our method are about 
5% less when the diameter is 2.0. The difference decreases 
when the diameter increases, and we can see that under 



random tree models with very unbalanced edge lengths, 
the difference between our result and the optimal result 
are much smaller than that of the uniform distributed tree 
model. 

In conclusion, our method that randomly picks adja- 
cencies from the maximum likelihood method could 
achieve accuracy that is comparable to the optimal 
result by picking all the correct adjacencies. For distant 
genomes, the original median-based parsimony methods 
cannot give result in reasonable time and we can expect 
that the result of parsimony method will not exceed the 
optimal result achieved by fixing part of the correct 
adjacencies. As a result, we assume that our method 
could achieve similar accuracy for distant genomes com- 
pared with the parsimony methods. Also the result 
shows that, although the accuracy decreases when gen- 
omes are more distant from each other, our method are 
still more accurate than the original maximum likeli- 
hood method. However, for very distant genomes (rear- 
rangement rate over 40% or diameter over 4.5), the 
maximum likelihood method can only produce result 
with very low accuracy, resulting not enough correct 
adjacencies for our mixture method. If we keep the ori- 
ginal selection percentage (70% to 75%), the accuracy of 
the inferred ancestral genome becomes very low. On the 
other hand, if we use lower selection percentages, it will 
be very difficult for the median calculation to finish in 
reasonable time as not so many adjacencies are fixed. 
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Figure 4 Comparison of the reconstructed gene adjacency accuracy with high rearrangement rates under uniform distribution binary 
trees. 
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Figure 5 Comparison of the reconstructed gene adjacency accuracy with high rearrangement rates under Lin's random tree model. 



Thus inferring the ancestral genome for very distant 
genomes is still a challenging problem for our method. 

Time consumption 

We also compare the time consumption of our mixture 
method and the original parsimony method. The time 
cost of our mixture method includes the time spent on 
the maximum likelihood method and the time for med- 
ian calculations with multiple fixed adjacencies. As sta- 
ted above, operations for calculating median with 
randomly selected adjacencies are independent from 
each other, so they can be performed in parallel. The 
test is performed on an server with 2.4GHz, 4 core 
CPUs. Average running time of the two methods is 
reported in seconds. 

Table 5 shows the comparison under uniform distri- 
bution binary trees with different rearrangement rates 
per edge. Table 6 shows the comparison under random 
trees with different diameters. The speedups range from 
1.8 to over 139 or even more, increasing along as the 
rearrangement rates or tree diameters increase. Also we 



can see that for distant genomes with rearrangement 
rate over 24% or diameter over 2.5, the original parsi- 
mony method based on ASMedian solver cannot finish 
the computation within reasonable time while our 
method can still accomplish computation in relatively 
short time. As shown in this table, our mixture method 
achieves significant speedups compared with the original 
parsimony method. 

Conclusions 

Ancestral gene order data can be inferred by either med- 
ian computations or the maximum likelihood method. 
Median calculation can give a more accurate result but is 
extremely time consuming under high rearrangement 
rates. Maximum likelihood method runs much faster but 
the inferred gene order is not as accurate as the result of 
median method. We propose a mixture method to infer 
the ancestral gene order based on the above two 
approaches to improve both time cost and accuracy. The 
main idea of our method is to pick the correct gene adja- 
cencies through a randomized procedure and reduce the 



Table 5 Comparison of average time cost under uniform Table 6 Comparison of average time cost under Lin's 



distribution binary trees random tree model 



Rearrangement rate per edge 


16% 


20% 


24% 


28% 


32% 


Tree diameter 


1.5 


2.0 


2.5 


3.0 


3.5 


Pure median method 


315 


15800 


1.4 X 10^ 


N/A 


N/A 


Pure median method 


8500 


76400 


2 X 10^ 


N/A 


N/A 


Our mixture metliod 


180 


760 


1270 


1620 


1850 


Our mixture method 


430 


840 


1430 


1920 


2230 


Speed up 


1.75 


20.78 


110.23 


N/A 


N/A 


Speed up 


19.76 


90.95 


139.86 


N/A 


N/A 
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median computation cost by fixing these correct gene 
adjacencies. Experiments show that our mixture method 
produces more accurate ancestral genomes than the 
maximum Ukelihood method while the computation time 
is far less than that of pure median method. 



Acknowledgements 

This work is partly supported by grar^t NSF 0904179. 

This article has beer^ published as part of BMC Genomics Volume 13 

Supplement 1, 2012: Selected articles from the Ter^th Asia Pacific 

Bioinformatics Cor^ferer^ce (APBC 2012). The full conter^ts of the supplemer^t 

are available or^lir^e at http://www.biomedcer^tral.com/1471-2164/13? 

issue=Sl. 



Author details 

^Departmer^t of Computer Scier^ce and Engineering, University of South 
Carolina, Columbia, SC 29208, USA. ^Center for Computational Biology, 
Beijing Forestry University, Beijing 100083, China. 

Authors' contributions 

YZ and JT contributed to the development and implementation of the 
algorithms, and FH were in charge of conducting simulations. 

Competing interests 

The authors declare that they have no competing interests. 



Published: 17 January 2012 



References 



Rokas A, Holland P: Rare genomic changes as a tool for phylogenetics. 

Trends Ecol Evol 2000, 15:454-459. 

Hannenhalli S, Pevzner P: Transforming cabbage into turnip (polynomial 
algorithm for sorting signed permutations by reversals). Proc 27th Ann 
Symp Theory of Computing 1995, 178-189. 

Moret BM, Wyman S, Bader DA, Warnow T, Yan M: A new implementation 
and detailed study of breakpoint analysis. Pac Symp Biocomput 2001, 
583-594. 

Bourque G, Pevzner PA: Genome-scale evolution: reconstructing gene 

orders in the ancestral species. Genome Res 2002, 12:26-36. 

Ma J: A probabilistic framework for inferring ancestral genomic orders. 

IEEE International Conference on Bioinformatics and Biomedicine 2010, 
179-184. 

Caprara A: The reversal median problem. INFORMS J Comput 2003, 
15(1):93-113. 

Bourque G, Pevzner PA: Genome-scale evolution: reconstructing gene 

orders in the ancestral species. Genome Res 2002, 12:26-36. 

Xu A, Sankoff D: Decompositions of multiple breakpoint graphs and 

rapid exact solutions to the median problem. Proc 8th International 

Workshop on Algorithms in Bioinformatics (WABI'08) 2008, 25-37. 

Lin Y, Rajan V, Swenson KM, Moret BM: Estimating true evolutionary 

distances under rearrangements, duplications, and losses. BMC 

Bioinformatics 2010, ll(Suppl 1):S54. 



doi:l 0.1 1 86/1 471 -21 64-1 3-Sl -S7 

Cite this article as: Zhang et al.: A mixture framework for inferring 
ancestral gene orders. BMC Genomics 2012, 13(Suppl 1):S7. 



Submit your next manuscript to BioMed Central 
and take full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



o 



BioMed Central 



